AITopics

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Beijing > Beijing (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Mahmoud ("Mido") Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, Mike Rabbat

Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Neural Information Processing SystemsFeb-13-2026, 07:58:47 GMT

Neural Information Processing Systems http://nips.cc/

agent, gala, simulator, (11 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsNov-20-2025, 18:03:19 GMT

TETRIS: TilE-matching the TRemendous Irregular Sparsity

Yu Ji, Ling Liang, Lei Deng, Youyang Zhang, Youhui Zhang, Yuan Xie

Compressing neural networks by pruning weights with small magnitudes can significantly reduce the computation and storage cost. Although pruning makes the model smaller, it is difficult to get a practical speedup in modern computing platforms such as CPU and GPU due to the irregularity.

machine learning, natural language, sparsity, (20 more...)

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)

Mahmoud ("Mido") Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, Mike Rabbat

Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Neural Information Processing SystemsAug-19-2025, 22:14:21 GMT

Neural Information Processing Systems http://nips.cc/

agent, gala, simulator, (11 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsAug-14-2025, 21:39:24 GMT

Appendix A Latency Driven Slimming Algorithm

We provide the details of the proposed latency-driven fast slimming in Alg. 1. Formulations of the Our major conclusions and speed analysis can be found in Sec. 3 and Figure 1. Compared to non-overlap large-kernel patch embedding (V5 in Tab. MHSA with the global receptive field is an essential contribution to model performance. By comparing V1 and V2 in Tab. 3, we can observe that the GN We explore ReLU and HardSwish (V3 and V4 in Tab. 3) in addition to GeLU We draw a conclusion that the activation function can be selected on a case-by-case basis depending on the specific hardware and compiler. In this work, we use GeLU to provide better performance than ReLU while executing faster.

efficientformer, efficientformer-l1, embed, (11 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

arXiv.org Artificial IntelligenceMay-21-2025

EfficientLLM: Efficiency in Large Language Models

Yuan, Zhengqing, Sun, Weixiang, Liu, Yixin, Zhou, Huichi, Zhou, Rong, Li, Yiyang, Zhang, Zheyuan, Song, Wei, Huang, Yue, Jia, Haolong, Murugesan, Keerthiram, Wang, Yu, He, Lifang, Gao, Jianfeng, Sun, Lichao, Ye, Yanfang

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

large language model, machine learning, natural language, (22 more...)

2505.1384

Country:

Asia > Middle East > Jordan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material > Course Syllabus & Notes (0.92)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)
Energy > Power Industry (0.46)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Hsu, Chih-Chia, Chang, Tian-Sheuan

ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network

arXiv.org Artificial IntelligenceMar-26-2025

--Deep learning-based super-resolution (SR) is challenging to implement in resource-constrained edge devices for resolutions beyond full HD due to its high computational complexity and memory bandwidth requirements. This paper introduces an 8K@30FPS SR accelerator with edge-selective dynamic input processing. Dynamic processing chooses the appropriate subnets for different patches based on simple input edge criteria, achieving a 50% MAC reduction with only a 0.1dB PSNR decrease. The quality of reconstruction images is guaranteed and maximized its potential with resource adaptive model switching even under resource constraints. In conjunction with hardware-specific refinements, the model size is reduced by 84% to 51K, but with a decrease of less than 0.6dB PSNR. Additionally, to support dynamic processing with high utilization, this design incorporates a configurable group of layer mapping that synergizes with the structure-friendly fusion block, resulting in 77% hardware utilization and up to 79% reduction in feature SRAM access. The implementation, using the TSMC 28nm process, can achieve 8K@30FPS throughput at 800MHz with a gate count of 2749K, 0.2075W power consumption, and 4797Mpixels/J energy efficiency, exceeding previous work. Deep learning-based super-resolution (SR) has gained prominence in recent years due to its exceptional performance. The growing demand for high-resolution (HD), ultra-HD or even 8K images in various edge device applications, including surveillance, medical imaging, virtual reality and digital entertainment, underscores its importance. Consequently, there is a pressing need for efficient hardware accelerators. V arious hardware accelerators have been proposed in recent years [1]-[5] for HD applications. However, due to the extensive computational demands and significant memory bandwidth requirements, many existing super-resolution accelerators opt for simplistic and extremely lightweight models, such as FSRCNN [6] or 1-D convolution [2], as their backbone. This often results in a compromise in both performance and perceptual quality. This work was supported by the National Science and Technology Council, Taiwan, under Grant 111-2622-8-A49-018-SB, 110-2221-E-A49-148-MY3, and 110-2218-E-A49-015-MBK.

artificial intelligence, convolution, machine learning, (17 more...)

2503.20245

Country:

Asia > Taiwan (0.25)
Europe > Belgium (0.04)

Genre: Research Report (1.00)

Industry: Semiconductors & Electronics (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-28-2025

T-REX: A 68-567 {\mu}s/token, 0.41-3.95 {\mu}J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET

Moon, Seunghyun, Li, Mao, Chen, Gregory, Knag, Phil, Krishnamurthy, Ram, Seok, Mingoo

This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.

accelerator, figure 23, hardware utilization, (13 more...)

2503.00322

Country:

North America > United States > Oregon > Washington County > Hillsboro (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

arXiv.org Artificial IntelligenceFeb-21-2025

LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design

Wei, Renjie, Xu, Songqiang, Zhong, Linfeng, Yang, Zebin, Guo, Qingyu, Wang, Yuan, Wang, Runsheng, Li, Meng

--State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved.

architecture, arxiv preprint arxiv, quantization, (13 more...)

2502.1526

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

arXiv.org Artificial IntelligenceFeb-14-2025

Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization

Pang, Bowen, Li, Kai, She, Ruifeng, Wang, Feifan

--With the development of large language models (LLMs), it has become increasingly important to optimize hardware usage and improve throughput. In this paper, we study the inference optimization of the serving system that deploys LLMs. T o optimize system throughput and maximize hardware utilization, we formulate the inference optimization problem as a mixed-integer programming (MIP) model and propose a hybrid offline-online method as solution. The offline method improves large-scale inference systems by introducing a Minimizing Makespan Bin Packing Problem. We further provide a theoretical lower bound computation method. Then, we propose an online sorting and preemptive scheduling method to better utilize hardware. In the online iteration scheduling process, a Lagrangian method is applied to evaluate the cost efficiency of inserting prefill stages versus decode stages at each iteration and dynamically determine when to preempt decoding tasks and insert prefill tasks. Experiments using real-world data from the LLaMA-65B model and the GSM8K dataset demonstrate that system utilization improves from 80.2% to 89.1%, and the total inference time decreases from 201.00 to 190.58 seconds. A 100-cases study shows that our method consistently outperforms the baseline method and improves the utilization rate by 8.0% on average. Finally, we discuss potential future extensions, including stochastic modeling, reinforcement learning-based schedulers, and dynamic decision-making strategies for system throughput and hardware utilization. Note to Practitioners --This work provides optimization tools for enhancing the efficiency of LLM inference systems through advanced scheduling techniques. From the perspective of LLM inference service providers, improved hardware utilization can reduce operational costs by requiring less hardware to maintain the same level of service. From the user's perspective, reduced inference time translates to faster response times and improved service quality. Furthermore, the proposed scheduling techniques are adaptable to various LLM models, hardware platforms, and datasets, making them highly scalable and broadly applicable to real-world LLM inference scenarios. Recent advancements in large language models (LLMs), including GPT -4, LLaMA, and Qwen, have significantly transformed the landscape of natural language processing by enabling more sophisticated text generation, comprehension, and interaction capabilities. These models serve as founda-tional technologies in a wide range of applications, such as chatbots, machine translation, and content creation. She are with Noah's Ark Lab, Huawei.

decode stage, inference, prefill stage, (14 more...)

2502.15763

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)